I wrote this to as a little fun unofficial project to practice my data science skills, looking at NBA scheduling patterns and their relationship to travel and rest. I used historical seasons and a draft portion of the 2024–25 schedule, focusing on density markers (e.g., 4 games in 6 nights), back-to-backs, rest-day distributions, cross-timezone movement, and travel distances. I also included an interactive schedule as a training for visualization. Additionally, I did a small machine learning exercise to quantify schedule-related effects on win probability.
rm(list = ls())
options(scipen = 999)
library(tidyverse)
library(here)
library(lubridate)
library(slider)
options(warn = -1)
suppressPackageStartupMessages({
library(dplyr)
library(lme4)
library(ggplot2)
library(slider)
library(scales)
library(plotly)
library(lubridate)
library(ggpattern)
library(cowplot)
})
# Note, you will likely have to change these paths. If your data is in the same folder as this project,
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule <- read_csv(here("schedule.csv"))
draft_schedule <- read_csv(here("schedule_24_partial.csv"))
locations <- read_csv(here("locations.csv"))
game_data <- read_csv(here("team_game_data.csv"))
Here I isolate OKC’s draft 80-game schedule and tag days where a game is the 4th within the past 6 nights (overlapping windows allowed). The printed result gives the count, and a small table shows the dates flagged
# keep OKC only from the 2024-25 season
okc_1 <- draft_schedule %>%
filter(team == "OKC") %>%
mutate(gamedate = as_date(gamedate)) %>%
arrange(gamedate)
# Set a window of 6 days
okc_1 <- okc_1 %>%
mutate(
games_past_6_nights = slide_index_dbl(
.x = gamedate,
.i = gamedate,
.f = ~ length(.x),
.before = days(5),
.complete = FALSE
),
is_4th_in_6_nights = games_past_6_nights == 4 # See if the game is the 4th one
)
# Result
count_4in6 <- sum(okc_1$is_4th_in_6_nights, na.rm = TRUE)
dates_4th <- okc_1 %>%
filter(is_4th_in_6_nights) %>%
select(gamedate, opponent, home, win)
count_4in6
## [1] 26
26 4-in-6 stretches in OKC’s draft schedule.
sch <- schedule %>%
mutate(gamedate = as_date(gamedate)) %>%
arrange(team, season, gamedate)
# Set the 6-day window of team x season
sch_tagged <- sch %>%
group_by(team, season) %>%
mutate(
games_past_6_nights = slide_index_dbl(
.x = gamedate,
.i = gamedate,
.f = ~ length(.x),
.before = days(5),
.complete = FALSE
),
is_4th_in_6 = games_past_6_nights == 4
) %>%
ungroup()
# Sum up
agg_team_season <- sch_tagged %>%
group_by(team, season) %>%
summarise(
games_played = n(),
four_in_six = sum(is_4th_in_6, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
# Unify to per 82 games
four_in_six_per82 = four_in_six * (82 / games_played)
)
overall_avg_per82 <- mean(agg_team_season$four_in_six_per82, na.rm = TRUE)
overall_avg_per82
## [1] 25.09998
25.1 4-in-6 stretches on average.
team_avg <- agg_team_season %>%
group_by(team) %>%
summarise(
avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(avg_4in6_per82))
top_team <- team_avg %>% slice(1)
bottom_team <- team_avg %>% slice(n())
paste(top_team$team, round(top_team$avg_4in6_per82, 1))
## [1] "CHA 28.1"
paste(bottom_team$team, round(bottom_team$avg_4in6_per82, 1))
## [1] "NYK 22.2"
Most 4-in-6 stretches on average: CHA (28.1) Fewest 4-in-6 stretches on average: NYK (22.2)
Analyzing the difference between most and least from Q3
mean_val <- mean(team_avg$avg_4in6_per82, na.rm = TRUE)
std_val <- sd(team_avg$avg_4in6_per82, na.rm = TRUE)
# Get CHA and NYK values
cha_val <- team_avg %>% filter(team == "CHA") %>% pull(avg_4in6_per82)
nyk_val <- team_avg %>% filter(team == "NYK") %>% pull(avg_4in6_per82)
# Calculate Z-scores
cha_z <- (cha_val - mean_val) / std_val
nyk_z <- (nyk_val - mean_val) / std_val
# Output results
list(
overall_avg_per82 = mean_val,
std_val = std_val,
cha_z = cha_z,
nyk_z = nyk_z
)
## $overall_avg_per82
## [1] 25.09998
##
## $std_val
## [1] 1.564988
##
## $cha_z
## [1] 1.922834
##
## $nyk_z
## [1] -1.861908
The gap is noticeable but not shocking. The difference between Charlotte (28.1) and New York (22.2) corresponds to about two standard deviations from the league average (25.1, SD = 1.56). Charlotte’s z-score is +1.92 and New York’s is –1.86, meaning both are near the threshold of what would usually be considered unusual, but not extreme outliers. Beyond random variation, schedule design factors may help explain the gap. Travel distance, divisional matchups, and arena availability all influence how games are clustered. For instance, New York’s dense geographic location reduces travel and allows for a more evenly spread schedule, while Charlotte, as a smaller-market team, may face more compressed stretches such as back-to-backs or 4-in-6 clusters.
Analyzing BKN’s defensive eFG% in the 2023-24 season and situations where where their opponent was on the second night of back-to-back
schedule <- schedule %>% mutate(gamedate = as_date(gamedate))
game_data <- game_data %>% mutate(gamedate = as_date(gamedate))
# Mark back to back
schedule_b2b <- schedule %>%
arrange(team, gamedate) %>%
group_by(team) %>%
mutate(prev_game = lag(gamedate),
days_rest = as.numeric(gamedate - prev_game),
b2b_flag = ifelse(days_rest == 1, "second", "other")) %>%
ungroup()
game_data_b2b <- game_data %>%
left_join(
schedule_b2b %>% select(gamedate, team, opp_b2b = b2b_flag),
by = c("gamedate" = "gamedate", "off_team" = "team")
)
# Filter out BKN as def
bkn_def <- game_data_b2b %>%
filter(season == 2023, def_team == "BKN")
# Calculate overall eFG%
bkn_def_overall <- bkn_def %>%
summarise(
total_fgm = sum(fg2made + fg3made, na.rm = TRUE),
total_fg3m = sum(fg3made, na.rm = TRUE),
total_fga = sum(fg2attempted + fg3attempted, na.rm = TRUE),
def_eFG = (total_fgm + 0.5 * total_fg3m) / total_fga
)
# eFG% when opp is back to back
bkn_def_b2b <- bkn_def %>%
filter(opp_b2b == "second") %>%
summarise(
total_fgm_b2b = sum(fg2made + fg3made, na.rm = TRUE),
total_fg3m_b2b = sum(fg3made, na.rm = TRUE),
total_fga_b2b = sum(fg2attempted + fg3attempted, na.rm = TRUE),
def_eFG_b2b = (total_fgm_b2b + 0.5 * total_fg3m_b2b) / total_fga_b2b
)
# result
result <- tibble(
team = "BKN",
season = "2023-24",
def_eFG_overall = bkn_def_overall$def_eFG,
def_eFG_opp_B2B = bkn_def_b2b$def_eFG_b2b
)
result
## # A tibble: 1 × 4
## team season def_eFG_overall def_eFG_opp_B2B
## <chr> <chr> <dbl> <dbl>
## 1 BKN 2023-24 0.543 0.535
This section highlights two simple league-level trends: fewer back-to-backs over time and a reduction in average travel per-82, consistent with more geographically efficient clustering
Trend 1: Fewer back-to-back games
The average number of back-to-back games per team has declined steadily since 2014, dropping from over 19 per team in 2014 to around 13–14 per team in recent seasons. This reflects an intentional scheduling adjustment by the league, possibly with intentionto reduce player fatigue and injury risk and give teams more rest between games. The distribution of rest days also confirms this change: compared to the 2014–2016 era, the 2021–2023 era features fewer “1-day rest” games and relatively more games with at least 2 rest days.
Trend 2: Travel optimization
In terms of traveling, there are two aspects: 1) Average team travel distance (per 82 games) has decreased, showing that schedules now tend to cluster road trips more efficiently, reducing unnecessary back-and-forth flights. 2) Cross-timezone travel has remained broadly stable over time (around 52–53 games per season), but the reduction in total travel distance suggests that the NBA has minimized long-haul trips by grouping games geographically.
2020 Spike:
Both trends show a noticeable spike in the 2019–20 season, which is likely to be attributed to the COVID-19 pandemic. The shortened and compressed schedule forced more consecutive games and unusual travel patterns, creating a temporary disruption. This anomaly is best understood as a one-off effect , rather than a reversal of the long-term trend.
library(ggplot2)
# Back to back trend over time
sch <- schedule %>%
mutate(gamedate = as_date(gamedate)) %>%
arrange(team, season, gamedate)
games_played <- sch %>%
count(team, season, name = "games_played")
sch_rest <- sch %>%
group_by(team, season) %>%
mutate(prev_date = lag(gamedate),
days_rest = as.numeric(gamedate - prev_date)) %>%
ungroup()
# B2B data for every team every season
team_season_b2b_raw <- sch_rest %>%
group_by(team, season) %>%
summarise(b2b_count = sum(days_rest == 1, na.rm = TRUE),
.groups = "drop")
# Adjust to per-82
team_season_b2b <- team_season_b2b_raw %>%
left_join(games_played, by = c("team","season")) %>%
mutate(b2b_per82 = 82 * b2b_count / games_played)
# League-level averages
league_b2b_trend <- team_season_b2b %>%
group_by(season) %>%
summarise(
avg_b2b_per82 = mean(b2b_per82, na.rm = TRUE),
med_b2b_per82 = median(b2b_per82, na.rm = TRUE),
.groups = "drop"
)
ggplot(league_b2b_trend, aes(x = season, y = avg_b2b_per82)) +
geom_line(linewidth = 1.2, color = "blue") +
geom_point(size = 2, color = "orangered") +
labs(
title = "Average Back-to-Backs per Team by Season (per 82 games)",
x = "Season (start year)",
y = "Avg B2B per-82"
) +
theme_minimal()
# Rest-day distribution plots (no per-82 needed)
rest_era <- sch_rest %>%
filter(!is.na(days_rest)) %>%
mutate(
era = case_when(
season %in% 2014:2016 ~ "2014–2016 (early)",
season %in% 2021:2023 ~ "2021–2023 (recent)",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(era), days_rest >= 0, days_rest <= 5)
ggplot(rest_era, aes(x = factor(days_rest), fill = era)) +
geom_bar(position = "dodge") +
scale_fill_manual(
values = c("2014–2016 (early)" = "blue",
"2021–2023 (recent)" = "orangered")
) +
labs(
title = "Rest-Day Distribution by Era",
x = "Days of Rest before a game",
y = "Number of games"
) +
theme_minimal()
ggplot(rest_era, aes(x = factor(days_rest), fill = era)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(
values = c("2014–2016 (early)" = "blue",
"2021–2023 (recent)" = "orangered")
) +
labs(
title = "Rest-Day Distribution: Early vs Recent Seasons",
x = "Days of Rest before a game",
y = "Proportion of games"
) +
theme_minimal()
# Travel distance trend
library(geosphere)
loc_team <- locations %>%
rename(team_lat = latitude,
team_lon = longitude,
home_tz = timezone)
loc_opp <- locations %>%
rename(opponent = team,
opp_lat = latitude,
opp_lon = longitude,
opp_tz = timezone)
schedule_loc <- sch %>%
left_join(loc_team, by = "team") %>%
left_join(loc_opp, by = "opponent") %>%
mutate(
home_flag = ifelse(home %in% c(TRUE, 1, "1", "H"), 1, 0),
game_lat = if_else(home_flag == 1, team_lat, opp_lat),
game_lon = if_else(home_flag == 1, team_lon, opp_lon)
)
# Calculate travel distance between consecutive games
travel_data <- schedule_loc %>%
arrange(team, season, gamedate) %>%
group_by(team, season) %>%
mutate(
prev_lat = lag(game_lat),
prev_lon = lag(game_lon),
travel_km = distHaversine(
cbind(prev_lon, prev_lat),
cbind(game_lon, game_lat)
) / 1000
) %>%
ungroup()
# Per-team totals and per-82 adjustment
team_travel <- travel_data %>%
group_by(team, season) %>%
summarise(total_travel_km = sum(travel_km, na.rm = TRUE), .groups = "drop") %>%
left_join(games_played, by = c("team","season")) %>%
mutate(travel_km_per82 = 82 * total_travel_km / games_played)
season_travel <- team_travel %>%
group_by(season) %>%
summarise(avg_travel_km = mean(total_travel_km, na.rm = TRUE))
# League average
season_travel <- team_travel %>%
group_by(season) %>%
summarise(avg_travel_km_per82 = mean(travel_km_per82, na.rm = TRUE), .groups = "drop")
ggplot(season_travel, aes(x = season, y = avg_travel_km_per82)) +
geom_line(color = "orangered", linewidth = 1.2) +
geom_point(color = "blue", size = 2) +
labs(
title = "Average Team Travel Distance by Season (per 82 games)",
x = "Season",
y = "Average Travel Distance per-82 (km)"
) +
theme_minimal()
# time zone
map_to_abbrev <- function(tz) {
case_when(
tz == "Eastern" ~ "ET",
tz == "Central" ~ "CT",
tz == "Mountain" ~ "MT",
tz == "Pacific" ~ "PT",
TRUE ~ "OTHER"
)
}
sch_tz <- schedule_loc %>%
mutate(home_tz_abbr = map_to_abbrev(home_tz),
opp_tz_abbr = map_to_abbrev(opp_tz),
cross_tz = as.integer(home_tz_abbr != opp_tz_abbr))
# Cross time zone game percentage for each team each season
team_cross_tz_raw <- sch_tz %>%
group_by(team, season) %>%
summarise(cross_tz_games = sum(cross_tz, na.rm = TRUE),
total_games = n(),
.groups = "drop")
# Adjust to per-82
team_cross_tz <- team_cross_tz_raw %>%
left_join(games_played, by = c("team","season")) %>%
mutate(
cross_tz_per82 = 82 * cross_tz_games / games_played,
share_cross_tz = cross_tz_games / total_games # share is already normalized
)
# League average
league_cross_tz <- team_cross_tz %>%
group_by(season) %>%
summarise(
avg_cross_tz_per82 = mean(cross_tz_per82, na.rm = TRUE),
avg_share_cross_tz = mean(share_cross_tz, na.rm = TRUE),
.groups = "drop"
)
ggplot(league_cross_tz, aes(season, avg_cross_tz_per82)) +
geom_line(linewidth = 1.2, color = "blue") +
geom_point(size = 2, color = "orangered") +
labs(title = "Average Cross-Timezone Games per Team by Season (per 82 games)",
x = "Season (start year)", y = "Cross-timezone games per-82") +
theme_minimal()
Season-View Visualization Tool
I built a calendar-style visualization that summarizes a team’s season: each day tile encodes travel since the previous game; overlays mark back-to-backs (B), 4-in-6 clusters (🔺), and home games (🏠). A second panel shows weekly density.
library(slider)
library(plotly)
library(scales)
library(lubridate)
library(ggpattern)
library(cowplot)
# Function for timezone and team
tz_abbrev <- function(x) {
case_when(
x %in% c("Eastern","ET") ~ "ET",
x %in% c("Central","CT") ~ "CT",
x %in% c("Mountain","MT") ~ "MT",
x %in% c("Pacific","PT") ~ "PT",
TRUE ~ "OTHER"
)
}
abbr_team <- function(x) x
# Preprocess
# schedule_df: team, opponent, home, gamedate, season
# locations_df: team, latitude, longitude, timezone
prepare_schedule <- function(schedule_df, locations_df) {
sch <- schedule_df %>%
mutate(
gamedate = as_date(gamedate),
season = as.integer(season),
home = if_else(home %in% c(TRUE, 1, "1", "H", "HOME"), 1L, 0L)
)
loc_team <- locations_df %>%
rename(team_lat = latitude, team_lon = longitude, home_tz = timezone)
loc_opp <- locations_df %>%
rename(opponent = team, opp_lat = latitude, opp_lon = longitude, opp_tz = timezone)
sch_loc <- sch %>%
left_join(loc_team, by = "team") %>%
left_join(loc_opp, by = "opponent") %>%
mutate(
# game site and time zone
site_lat = if_else(home == 1L, team_lat, opp_lat),
site_lon = if_else(home == 1L, team_lon, opp_lon),
site_tz = if_else(home == 1L, home_tz, opp_tz),
site_tz = tz_abbrev(site_tz)
) %>%
arrange(team, season, gamedate) %>%
group_by(team, season) %>%
mutate(
# info of the previous game
prev_date = lag(gamedate),
prev_lat = lag(site_lat),
prev_lon = lag(site_lon),
prev_tz = lag(site_tz),
# days of rest and back to back
days_rest = as.numeric(gamedate - prev_date),
is_b2b = days_rest == 1,
# 4 in 6
games_past_6 =
slide_index_int(
.x = rep(1L, n()),
.i = gamedate,
.f = ~ length(.x),
.before = 5,
.complete = FALSE
),
is_4in6 = games_past_6 == 4L,
# if cross timezone for two consecutive games
cross_tz_prev = !is.na(prev_tz) & !is.na(site_tz) & prev_tz != site_tz,
# travel distance
travel_km = if_else(
is.na(prev_lat) | is.na(prev_lon) | is.na(site_lat) | is.na(site_lon),
0,
as.numeric(distHaversine(cbind(prev_lon, prev_lat), cbind(site_lon, site_lat)) / 1000)
),
# text
opp_lab = abbr_team(opponent),
home_away = if_else(home == 1L, "Home", "Away"),
tooltip = paste0(
format(gamedate, "%Y-%m-%d"), " (", home_away, ") vs ", opp_lab,
"<br>Rest days: ", ifelse(is.na(days_rest), "-", days_rest),
ifelse(is_b2b, " | B2B", ""),
ifelse(is_4in6, " | 4-in-6", ""),
"<br>Travel (prev→this): ", comma(travel_km), " km",
ifelse(cross_tz_prev, " | Cross TZ", ""),
"<br>Site TZ: ", site_tz
)
) %>%
ungroup()
# weekly game density
weekly <- sch_loc %>%
mutate(week = floor_date(gamedate, unit = "week", week_start = 1)) %>%
count(team, season, week, name = "games_in_week")
list(games = sch_loc, weekly = weekly)
}
# interactive plot function
# team_abbr and season_int: 2024
plot_team_schedule <- function(prepped, team_abbr, season_int,
dense_threshold = 4,
travel_scale_range = c(3, 12),
initial_weeks_shown = 25) {
games <- prepped$games %>% dplyr::filter(team == team_abbr, season == season_int)
weekly <- prepped$weekly %>% dplyr::filter(team == team_abbr, season == season_int)
validate_rows <- nrow(games)
if (validate_rows == 0) stop("No rows for the specified team/season. Check inputs.")
season_start0 <- lubridate::floor_date(min(games$gamedate, na.rm = TRUE),
unit = "week", week_start = 1)
validate_rows <- nrow(games)
if (validate_rows == 0) {
stop("No rows for the specified team/season. Check inputs.")
}
games <- games %>%
dplyr::mutate(
week_start = lubridate::floor_date(gamedate, unit = "week", week_start = 1),
week_idx = as.integer((week_start - .env$season_start0) / 7) + 1L,
weekday_raw = lubridate::wday(gamedate, label = TRUE, abbr = TRUE, week_start = 1),
weekday = factor(weekday_raw, levels = levels(weekday_raw), ordered = FALSE)
) %>%
dplyr::select(-weekday_raw)
weekly <- weekly %>%
dplyr::mutate(
week_idx = as.integer((week - .env$season_start0) / 7) + 1L,
tt = paste0(
format(week, "%Y-%m-%d"),
"<br>Games this week: ", games_in_week,
ifelse(games_in_week >= dense_threshold, " (dense)", "")
)
)
# cluster to day
games_day <- games %>%
group_by(week_idx, weekday) %>%
summarise(
n_games = n(),
n_home = sum(home == 1L, na.rm = TRUE),
n_away = sum(home == 0L, na.rm = TRUE),
any_b2b = any(is_b2b, na.rm = TRUE),
any_4in6 = any(is_4in_6 <- is_4in6, na.rm = TRUE),
max_trvl = suppressWarnings(max(travel_km, na.rm = TRUE)),
# tooltip
lines = paste0(
format(gamedate, "%Y-%m-%d"),
" — ", ifelse(home == 1L, "Home vs ", "Away @ "), opponent,
ifelse(is_b2b, " | B2B", ""),
ifelse(is_4in6, " | 4-in-6", ""),
" | Travel: ", comma(round(travel_km)), " km"
) %>% paste(collapse = "<br>"),
.groups = "drop"
)
# 7×N
max_week <- max(games$week_idx, na.rm = TRUE)
weekday_lvls <- levels(games$weekday)
grid <- tidyr::expand_grid(
week_idx = 1:max_week,
weekday = factor(weekday_lvls, levels = weekday_lvls, ordered = FALSE)
) %>%
left_join(
games_day %>%
mutate(weekday = factor(weekday, levels = weekday_lvls, ordered = FALSE)),
by = c("week_idx", "weekday")
) %>%
mutate(
n_games = tidyr::replace_na(n_games, 0L),
n_home = tidyr::replace_na(n_home, 0L),
n_away = tidyr::replace_na(n_away, 0L),
any_b2b = tidyr::replace_na(any_b2b, FALSE),
any_4in6 = tidyr::replace_na(any_4in6, FALSE),
max_trvl = ifelse(is.finite(max_trvl), max_trvl, NA_real_),
# for coloring
day_type = dplyr::case_when(
n_games == 0L ~ "None",
n_home > 0L & n_away == 0L ~ "Home",
n_away > 0L & n_home == 0L ~ "Away",
TRUE ~ "Mixed"
),
# tooltip
tt = ifelse(
n_games == 0L,
paste0("Week ", week_idx, " ", weekday, ": No game"),
paste0("Week ", week_idx, " ", weekday, "<br>", lines)
)
)
# Top panel: season timeline
# size = travel km; color = home/away; shape = B2B; gold triangle = 4-in-6
p_calendar <- ggplot(grid, aes(x = weekday, y = week_idx)) +
# show when multiple games
geom_text(aes(label = ifelse(n_games > 1L, as.character(n_games), "")),
size = 3.2, fontface = "bold") +
# B2B start
geom_text(
data = grid %>% dplyr::filter(any_b2b & n_games > 0L),
aes(x = weekday, y = week_idx, label = "B"),
size = 3,
position = position_nudge(x = -0.09),
show.legend = TRUE
) +
# 4-in-6 triangle
geom_text(
data = grid %>% dplyr::filter(any_4in6 & n_games > 0L),
aes(x = weekday, y = week_idx, label = "🔺"),
size = 3,
position = position_nudge(x = +0.09),
inherit.aes = FALSE
) +
geom_tile(aes(fill = max_trvl, text = tt), color = "grey90", width = 0.95, height = 0.95) +
scale_fill_gradient(
name = "Travel (km)",
low = "#E6F2FF", high = "dodgerblue", na.value = "#f7f7f7"
) +
geom_text(
data = grid %>% dplyr::filter(n_home > 0L & n_games > 0L),
aes(x = weekday, y = week_idx, label = "🏠"),
position = position_nudge(x = -0.27, y = 0),
size = 2.5,
inherit.aes = FALSE
) +
scale_x_discrete(limits = weekday_lvls, drop = FALSE) +
scale_y_reverse(breaks = pretty_breaks(), expand = expansion(add = 0.2)) +
labs(
title = paste0(team_abbr, " — Season ", season_int, " (Mon–Sun × Week grid)"),
x = "Weekday", y = "Week number (from season start)"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid = element_blank(),
plot.title = element_text(face = "bold")
)
# Bottom panel: weekly game density
p_weekly <- ggplot(weekly, aes(x = week_idx, y = games_in_week)) +
geom_col(fill = "blue") +
geom_hline(yintercept = dense_threshold, linetype = 2, color = "orangered") +
geom_col(
data = weekly %>% dplyr::filter(games_in_week >= dense_threshold),
aes(x = week_idx, y = games_in_week),
fill = "orangered"
) +
geom_text(
data = weekly %>% dplyr::filter(games_in_week >= dense_threshold),
aes(label = games_in_week), vjust = -0.25, color = "black", fontface = "bold"
) +
scale_x_continuous(breaks = pretty_breaks()) +
scale_y_continuous(breaks = 0:7, limits = c(0, max(weekly$games_in_week, 5) + 1)) +
labs(title = paste0(team_abbr, " — Weekly game density"),
x = "Week number", y = "Games / week") +
theme_minimal(base_size = 12) +
theme(panel.grid.minor = element_blank(), plot.title = element_text(size = 11))
# interactive
g1 <- ggplotly(p_calendar, tooltip = "text") %>%
layout(
margin = list(t = 60, b = 0),
yaxis = list(autorange = "reversed", fixedrange = FALSE),
xaxis = list(fixedrange = FALSE)
)
g2 <- ggplotly(p_weekly, tooltip = "text") %>%
layout(margin = list(t = 40), xaxis = list(fixedrange = FALSE), yaxis = list(fixedrange = FALSE))
subplot(g1, g2, nrows = 2, heights = c(0.72, 0.28), shareX = FALSE, titleY = TRUE) %>%
layout(
legend = list(orientation = "h", y = -0.1),
hoverlabel = list(bgcolor = "white")
)
}
prepped_2425 <- prepare_schedule(draft_schedule, locations)
p_okc <- plot_team_schedule( prepped_2425, team_abbr = "OKC", season_int = 2024, dense_threshold = 4, travel_scale_range = c(3, 12), initial_weeks_shown = 25 )
p_den <- plot_team_schedule( prepped_2425, team_abbr = "DEN", season_int = 2024, dense_threshold = 4, travel_scale_range = c(3, 12), initial_weeks_shown = 25 )
p_okc
p_den
I created a visualization tool that presents a team’s entire season schedule in an intuitive way. The calendar grid (Mon to Sun × Week) highlights each day with a background color based on travel distance (light blue for short trips, darker blue for longer ones) and overlays key schedule markers: “B” for back-to-back games, the red triangle emoji for 4-in-6 stretches, and the house emoji for home games. This helps viewers immediately spot periods of heavy travel or dense game clusters, while hover tooltips provide detailed information such as date, opponent, and travel distance.
The weekly bar chart summarizes the number of games per week, with a threshold line of 4 games a week to quickly flag high-density weeks. Together, these two panels offer both a horizontal view of weekly game load and a vertical view of season-long travel and schedule distribution. This design of this tool may help teams and analysts identify high-risk stretches of the season, supporting better preparation, workload management, and evaluations of schedule.
Below is a compact mixed-effects logistic model as a first-pass way to quantify schedule-related effects while controlling for opponent strength proxies.
library(readr)
# Safe division function to avoid dividing by zero or NA
safe_div <- function(num, den) ifelse(is.na(den) | den == 0, NA_real_, num / den)
# -------------------------
# (1) Row-level (per game) metrics
# -------------------------
game_data_metrics <- game_data %>%
mutate(
# Offensive metrics (for off_team)
PPA = safe_div(shotattemptpoints, shotattempts), # ~2*TS%
AST_pct = safe_div(assists, fgmade),
OREB_pct = safe_div(reboffensive, reboundchance),
DREB_pct = safe_div(rebdefensive, reboundchance),
TOV_pct = safe_div(turnovers, shotattempts + turnovers),
ORTG = safe_div(points, possessions / 100),
# Defensive metrics (for def_team; derived from opponent's offense)
STL_rate_def = safe_div(stealsagainst, possessions),
STL_per100_def = 100 * STL_rate_def,
BLK_pct_def = safe_div(blocksagainst, fg2attempted),
DRTG_def = safe_div(points, possessions / 100)
)
# -------------------------
# (2) Season-to-date (pre-game) cumulative metrics
# -------------------------
# (2a) Offensive S2D (by off_team)
off_cum_before <- game_data %>%
arrange(season, off_team, gamedate, nbagameid) %>%
group_by(season, team = off_team) %>%
mutate(
c_poss_off = cumsum(coalesce(possessions, 0)),
c_pts_off = cumsum(coalesce(points, 0)),
c_shotpts = cumsum(coalesce(shotattemptpoints, 0)),
c_shota = cumsum(coalesce(shotattempts, 0)),
c_ast = cumsum(coalesce(assists, 0)),
c_fgmade = cumsum(coalesce(fgmade, 0)),
c_oreb = cumsum(coalesce(reboffensive, 0)),
c_rebchance_off = cumsum(coalesce(reboundchance, 0)),
c_tov = cumsum(coalesce(turnovers, 0))
) %>%
mutate(
poss_off_before = lag(c_poss_off),
pts_off_before = lag(c_pts_off),
shotpts_before = lag(c_shotpts),
shota_before = lag(c_shota),
ast_before = lag(c_ast),
fgmade_before = lag(c_fgmade),
oreb_before = lag(c_oreb),
rebchance_off_before = lag(c_rebchance_off),
tov_before = lag(c_tov)
) %>%
mutate(
off_ORTG_before = safe_div(pts_off_before, poss_off_before / 100),
off_PPA_before = safe_div(shotpts_before, shota_before),
off_AST_pct_before = safe_div(ast_before, fgmade_before),
off_OREB_pct_before = safe_div(oreb_before, rebchance_off_before),
off_TOV_pct_before = safe_div(tov_before, shota_before + tov_before)
) %>%
ungroup() %>%
select(season, nbagameid, team,
off_ORTG_before, off_PPA_before, off_AST_pct_before,
off_OREB_pct_before, off_TOV_pct_before)
# (2b) Defensive S2D (by def_team)
def_cum_before <- game_data %>%
arrange(season, def_team, gamedate, nbagameid) %>%
group_by(season, team = def_team) %>%
mutate(
c_poss_def = cumsum(coalesce(possessions, 0)), # defensive poss = opponent poss
c_pts_allowed = cumsum(coalesce(points, 0)),
c_steals_for = cumsum(coalesce(stealsagainst, 0)),
c_blocks_for = cumsum(coalesce(blocksagainst, 0)),
c_opp_fg2a = cumsum(coalesce(fg2attempted, 0))
) %>%
mutate(
poss_def_before = lag(c_poss_def),
pts_allowed_before = lag(c_pts_allowed),
steals_before = lag(c_steals_for),
blocks_before = lag(c_blocks_for),
opp_fg2a_before = lag(c_opp_fg2a)
) %>%
mutate(
def_DRTG_before = safe_div(pts_allowed_before, poss_def_before / 100),
def_STL_rate_before = safe_div(steals_before, poss_def_before),
def_STL_per100_before = 100 * def_STL_rate_before,
def_BLK_pct_before = safe_div(blocks_before, opp_fg2a_before)
) %>%
ungroup() %>%
select(season, nbagameid, team,
def_DRTG_before, def_STL_rate_before, def_STL_per100_before, def_BLK_pct_before)
# (2c) Merge S2D back to each row
game_with_s2d <- game_data_metrics %>%
# focal OFF S2D
left_join(
off_cum_before,
by = c("season", "nbagameid", "off_team" = "team")
) %>%
# opponent DEF S2D
left_join(
def_cum_before %>%
rename(
opp_def_DRTG_before = def_DRTG_before,
opp_def_STL_rate_before = def_STL_rate_before,
opp_def_STL_per100_before = def_STL_per100_before,
opp_def_BLK_pct_before = def_BLK_pct_before
),
by = c("season", "nbagameid", "def_team" = "team")
) %>%
# opponent OFF S2D
left_join(
off_cum_before %>%
rename(
opp_off_team = team,
opp_off_ORTG_before = off_ORTG_before,
opp_off_PPA_before = off_PPA_before,
opp_off_AST_pct_before = off_AST_pct_before,
opp_off_OREB_pct_before = off_OREB_pct_before,
opp_off_TOV_pct_before = off_TOV_pct_before
),
by = c("season", "nbagameid", "def_team" = "opp_off_team")
) %>%
# focal pre-game NET rating (OFF S2D - opponent DEF S2D)
mutate(
focal_NET_RTG_before = off_ORTG_before - opp_def_DRTG_before
)
# -------------------------
# (3) Aggregate team × season metrics
# -------------------------
# Offensive aggregates
off_agg <- game_data %>%
group_by(season, team = off_team) %>%
summarise(
poss_off = sum(possessions, na.rm = TRUE),
pts_off = sum(points, na.rm = TRUE),
shotpts_sum = sum(shotattemptpoints, na.rm = TRUE),
shota_sum = sum(shotattempts, na.rm = TRUE),
ast_sum = sum(assists, na.rm = TRUE),
fgmade_sum = sum(fgmade, na.rm = TRUE),
oreb_sum = sum(reboffensive, na.rm = TRUE),
rebchance_off = sum(reboundchance, na.rm = TRUE),
tov_sum = sum(turnovers, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
ORTG = safe_div(pts_off, poss_off / 100),
PPA = safe_div(shotpts_sum, shota_sum),
AST_pct = safe_div(ast_sum, fgmade_sum),
OREB_pct = safe_div(oreb_sum, rebchance_off),
TOV_pct = safe_div(tov_sum, shota_sum + tov_sum)
) %>%
select(season, team, ORTG, PPA, AST_pct, OREB_pct, TOV_pct)
# Defensive aggregates
def_agg <- game_data %>%
group_by(season, team = def_team) %>%
summarise(
poss_def = sum(possessions, na.rm = TRUE),
pts_allowed = sum(points, na.rm = TRUE),
steals_for = sum(stealsagainst, na.rm = TRUE),
blocks_for = sum(blocksagainst, na.rm = TRUE),
opp_fg2a_sum = sum(fg2attempted, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
DRTG = safe_div(pts_allowed, poss_def / 100),
STL_rate = safe_div(steals_for, poss_def),
STL_per100 = 100 * STL_rate,
BLK_pct = safe_div(blocks_for, opp_fg2a_sum)
) %>%
select(season, team, DRTG, STL_rate, STL_per100, BLK_pct)
# Combine and compute Net Rating
team_season_metrics <- off_agg %>%
full_join(def_agg, by = c("season", "team")) %>%
mutate(NET_RTG = ORTG - DRTG) %>%
arrange(season, team)
# Schedule related data process
# Merge all four schedule-related tables
schedule_model <- sch_rest %>%
left_join(
travel_data %>%
select(season, gamedate, team, opponent, home, win, travel_km),
by = c("season","gamedate","team","opponent","home","win")
) %>%
left_join(
schedule_b2b %>%
select(season, gamedate, team, opponent, home, win, b2b_flag),
by = c("season","gamedate","team","opponent","home","win")
) %>%
left_join(
sch_tz %>%
select(season, gamedate, team, opponent, home, win, cross_tz),
by = c("season","gamedate","team","opponent","home","win")
)
schedule_model <- schedule_model %>%
mutate(gamedate = as.Date(gamedate))
schedule_model <- schedule_model %>%
arrange(team, season, gamedate) %>%
group_by(team, season) %>%
mutate(
games_in_4 = slide_int(
gamedate,
~ sum(.x >= .x[length(.x)] - 3 & .x <= .x[length(.x)]),
.before = Inf,
.complete = TRUE
),
games_in_6 = slide_int(
gamedate,
~ sum(.x >= .x[length(.x)] - 5 & .x <= .x[length(.x)]),
.before = Inf,
.complete = TRUE
)
) %>%
ungroup()
# Merge
s2d_model <- game_with_s2d %>%
mutate(gamedate = as.Date(gamedate)) %>%
select(
season, gamedate, off_team, def_team,
# focal team's S2D (pre-game)
off_ORTG_before, off_PPA_before, off_AST_pct_before,
off_OREB_pct_before, off_TOV_pct_before,
# opponent defense S2D (already renamed in previous step)
opp_def_DRTG_before, opp_def_STL_rate_before,
opp_def_STL_per100_before, opp_def_BLK_pct_before,
# opponent offense S2D
opp_off_ORTG_before, opp_off_PPA_before,
opp_off_AST_pct_before, opp_off_OREB_pct_before, opp_off_TOV_pct_before,
# optional composite
focal_NET_RTG_before
)
model_data <- schedule_model %>%
mutate(
gamedate = as.Date(gamedate),
time_zone_change = cross_tz,
b2b = ifelse(is.na(b2b_flag), 0L, b2b_flag)
) %>%
left_join(
s2d_model,
by = c("season", "gamedate", "team" = "off_team", "opponent" = "def_team")
)
anti <- schedule_model %>%
mutate(gamedate = as.Date(gamedate)) %>%
anti_join(
s2d_model, by = c("season", "gamedate", "team" = "off_team", "opponent" = "def_team")
)
cat("Unmatched rows after join:", nrow(anti), "\n")
## Unmatched rows after join: 0
# Model
model_data <- model_data %>%
filter(season >= 2019 & season <= 2023)
model_data <- model_data %>%
mutate(b2b = ifelse(b2b == "second", 1, 0))
library(lme4)
model_main <- glmer(
win ~ home + b2b + scale(days_rest) + scale(travel_km) + time_zone_change +
opp_off_ORTG_before + opp_def_DRTG_before +
factor(season) + (1 | team),
data = model_data,
family = binomial(link = "logit")
)
library(broom.mixed)
# Fixed effects table only
tidy(model_main, effects = "fixed", conf.int = TRUE) %>%
select(term, estimate, std.error, conf.low, conf.high, p.value)
## # A tibble: 12 × 6
## term estimate std.error conf.low conf.high p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.102 0.771 -1.41 1.61 8.95e- 1
## 2 home 0.431 0.0413 0.350 0.513 1.69e-25
## 3 b2b -0.255 0.0526 -0.358 -0.152 1.24e- 6
## 4 scale(days_rest) 0.0114 0.0193 -0.0264 0.0492 5.53e- 1
## 5 scale(travel_km) -0.0396 0.0209 -0.0806 0.00140 5.84e- 2
## 6 time_zone_change -0.0295 0.0436 -0.115 0.0560 4.99e- 1
## 7 opp_off_ORTG_before -0.0812 0.00502 -0.0910 -0.0713 6.85e-59
## 8 opp_def_DRTG_before 0.0789 0.00528 0.0685 0.0892 1.58e-50
## 9 factor(season)2020 0.0110 0.0672 -0.121 0.143 8.70e- 1
## 10 factor(season)2021 0.00367 0.0631 -0.120 0.127 9.54e- 1
## 11 factor(season)2022 0.0116 0.0703 -0.126 0.149 8.69e- 1
## 12 factor(season)2023 0.0155 0.0757 -0.133 0.164 8.38e- 1
# Random effects variance
tidy(model_main, effects = "ran_pars")
## # A tibble: 1 × 4
## effect group term estimate
## <chr> <chr> <chr> <dbl>
## 1 ran_pars team sd__(Intercept) 0.412
# Robustness: XGBoost classification
library(Matrix)
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
##
## slice
## The following object is masked from 'package:dplyr':
##
## slice
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
x_formula <- ~
home + b2b + scale(days_rest) + scale(travel_km) + time_zone_change +
opp_off_ORTG_before + opp_def_DRTG_before +
opp_off_PPA_before + opp_off_AST_pct_before +
opp_off_OREB_pct_before + opp_off_TOV_pct_before +
opp_def_STL_rate_before + opp_def_BLK_pct_before +
factor(season) + factor(team) - 1
# Model matrix
X <- model.matrix(x_formula, data = model_data)
X[is.na(X)] <- 0
X <- Matrix(X, sparse = TRUE)
y <- model_data$win
# Train / test split (80/20)
n <- nrow(X)
idx_train <- sample.int(n, floor(0.8 * n))
idx_test <- setdiff(seq_len(n), idx_train)
dtrain <- xgb.DMatrix(X[idx_train, ], label = y[idx_train])
dtest <- xgb.DMatrix(X[idx_test, ], label = y[idx_test])
# Cross-validated search for best nrounds with early stopping
params <- list(
objective = "binary:logistic",
eval_metric = "logloss",
max_depth = 6,
eta = 0.08,
subsample = 0.8,
colsample_bytree = 0.8,
min_child_weight = 5,
lambda = 1
)
cv <- xgb.cv(
params = params,
data = dtrain,
nrounds = 2000,
nfold = 5,
stratified = TRUE,
early_stopping_rounds = 50,
verbose = 0
)
best_nrounds <- cv$best_iteration
cat("Best CV nrounds:", best_nrounds, "\n")
## Best CV nrounds: 32
# Train final model with best_nrounds
watchlist <- list(train = dtrain, test = dtest)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds,
watchlist = watchlist,
verbose = 0
)
# Simple diagnostics on test set
pred_prob <- predict(xgb_model, dtest)
roc_obj <- pROC::roc(y[idx_test], pred_prob, quiet = TRUE)
auc_val <- pROC::auc(roc_obj)
logloss <- -mean(y[idx_test] * log(pmax(pred_prob, 1e-15)) +
(1 - y[idx_test]) * log(pmax(1 - pred_prob, 1e-15)))
acc <- mean( (pred_prob >= 0.5) == as.logical(y[idx_test]) )
cat(sprintf("Test AUC = %.3f | Logloss = %.4f | Accuracy = %.3f\n",
as.numeric(auc_val), logloss, acc))
## Test AUC = 0.567 | Logloss = 0.6849 | Accuracy = 0.544
# Feature importance (gain-based)
imp <- xgb.importance(model = xgb_model)
print(head(imp, 20))
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: opp_def_BLK_pct_before 0.09208077 0.109934475 0.10253456
## 2: opp_def_STL_rate_before 0.08273674 0.035058794 0.10368664
## 3: opp_def_DRTG_before 0.07796805 0.089150418 0.08525346
## 4: opp_off_ORTG_before 0.07658510 0.041327081 0.08755760
## 5: opp_off_TOV_pct_before 0.06801714 0.056201328 0.07834101
## 6: factor(team)DET 0.06755210 0.094276116 0.02304147
## 7: opp_off_OREB_pct_before 0.06572309 0.046530466 0.07949309
## 8: opp_off_PPA_before 0.06274678 0.030405600 0.07603687
## 9: opp_off_AST_pct_before 0.05636446 0.030881886 0.07603687
## 10: scale(travel_km) 0.04246523 0.024654053 0.07027650
## 11: factor(season)2019 0.03616496 0.003091843 0.02188940
## 12: factor(team)DEN 0.02757436 0.054210137 0.01267281
## 13: factor(team)HOU 0.02666309 0.049261366 0.01152074
## 14: factor(team)MIL 0.02445965 0.045026194 0.01036866
## 15: factor(team)SAC 0.02139142 0.058457442 0.01497696
## 16: factor(season)2020 0.02120323 0.002580288 0.01958525
## 17: factor(team)BOS 0.01841763 0.044471013 0.01152074
## 18: factor(team)PHX 0.01706500 0.038889010 0.01036866
## 19: factor(team)CHA 0.01690314 0.031449264 0.01036866
## 20: factor(season)2021 0.01461757 0.004378490 0.01497696
## Feature Gain Cover Frequency
# Plot
xgb.plot.importance(imp[1:min(20, nrow(imp)), ])
library(lme4)
sc_days <- attr(scale(model_data$days_rest), "scaled:center")
sc_travel <- attr(scale(model_data$travel_km), "scaled:center")
model_data <- model_data %>%
mutate(
# actual
p_actual = predict(model_main, newdata = ., type = "response", re.form = NA),
# counterfactual
p_cf = predict(
model_main,
newdata = mutate(.,
home = 0,
b2b = 0,
days_rest = sc_days,
travel_km = sc_travel,
time_zone_change = 0),
type = "response",
re.form = NA
),
schedule_effect = p_actual - p_cf
)
team_schedule_wins <- model_data %>%
group_by(team) %>%
summarise(schedule_wins = sum(schedule_effect, na.rm = TRUE)) %>%
arrange(desc(schedule_wins))
most_helped <- team_schedule_wins %>% slice_max(schedule_wins, n = 1)
most_hurt <- team_schedule_wins %>% slice_min(schedule_wins, n = 1)
cat(sprintf(
"Most Helped by Schedule: %s (%.1f wins)\nMost Hurt by Schedule: %s (%.1f wins)\n",
most_helped$team, most_helped$schedule_wins,
most_hurt$team, most_hurt$schedule_wins
))
## Most Helped by Schedule: WAS (15.1 wins)
## Most Hurt by Schedule: POR (12.4 wins)
Model explanation:
The game outcome is estimated with a mixed-effects logistic regression, with win probability as the dependent variable. The model includes key schedule-related covariates (back-to-back status, days of rest, travel distance, and time zone changes) and opponent strength (season-to-date offensive and defensive ratings), plus season fixed effects and team random intercepts. I excluded cumulative opponent metrics such as assist rate, turnover rate, and block rate, which were highly collinear with core offensive/defensive ratings and made the model unstable. This choice prioritizes interpretability and convergence while retaining the most meaningful measures of schedule burden and opponent quality. As a robustness check, an XGBoost model produced similar variable importance rankings, highlighting opponent defense and travel distance. Diagnostics (convergence, feature importance, and residual patterns) indicate stable estimates, supporting the reliability of the findings.
This project is not meant to be super rigorous, and there might be mistakes here and there. This project is just for fun! I just wanted to see what I could do with my data science skills and applied them to something I’m interested in. I had a lot of fun making it :)